Our dataset features 81,013 observations of 31 variables (shown below) which are related to aviation accidents recorded. Each accident is provided an “Event.Id”. We can also see that variables provided information on a variety of topics including the date and location of the accident, the model and type aircraft, information on the sustained injuries to passengers and to the aircraft, and the reported weather conditions at the time.
## [1] 81013 31
## [1] "Event.Id" "Investigation.Type"
## [3] "Accident.Number" "Event.Date"
## [5] "Location" "Country"
## [7] "Latitude" "Longitude"
## [9] "Airport.Code" "Airport.Name"
## [11] "Injury.Severity" "Aircraft.Damage"
## [13] "Aircraft.Category" "Registration.Number"
## [15] "Make" "Model"
## [17] "Amateur.Built" "Number.of.Engines"
## [19] "Engine.Type" "FAR.Description"
## [21] "Schedule" "Purpose.of.Flight"
## [23] "Air.Carrier" "Total.Fatal.Injuries"
## [25] "Total.Serious.Injuries" "Total.Minor.Injuries"
## [27] "Total.Uninjured" "Weather.Condition"
## [29] "Broad.Phase.of.Flight" "Report.Status"
## [31] "Publication.Date"
Let’s have a look at the first 5 rows of the dataset for an initial look at how the dataframe is organised.
## Event.Id Investigation.Type Accident.Number Event.Date
## 1 20171230X01728 Accident WPR18LA058 2017-12-29
## 2 20171229X23639 Accident GAA18CA094 2017-12-28
## 3 20180102X10253 Accident CEN18LA064 2017-12-28
## 4 20171227X12320 Accident GAA18CA096 2017-12-27
## 5 20171228X35517 Accident CEN18LA062 2017-12-27
## Location Country Latitude Longitude Airport.Code
## 1 San Bernardino, CA United States 34.09167 -117.24778 SBD
## 2 Spearfish, SD United States 44.48472 -103.78805 SPF
## 3 Paulding, OH United States 41.18889 -84.70806
## 4 Upper Lake, CA United States 39.44306 -122.96028 1Q5
## 5 Michigan City, IN United States 41.69528 86.82556 MGC
## Airport.Name Injury.Severity Aircraft.Damage
## 1 SAN BERNARDINO INTL Non-Fatal Substantial
## 2 BLACK HILLS-CLYDE ICE FIELD Non-Fatal Substantial
## 3 N/A Non-Fatal Substantial
## 4 Unavailable
## 5 Michigan City Municipal Airpor Non-Fatal Substantial
## Aircraft.Category Registration.Number Make
## 1 Airplane N222AT CESSNA
## 2 Airplane N11KX PIPER
## 3 Airplane N4JW PIPISTREL DOO AJDOVSCINA
## 4 Airplane N3945A AEROPRO CZ
## 5 Airplane N525KT CESSNA
## Model Amateur.Built Number.of.Engines Engine.Type
## 1 210 No 1 Reciprocating
## 2 PA 32-301T No 1
## 3 VIRUS SW 1 Reciprocating
## 4 EUROFOX No 1
## 5 525A No NA
## FAR.Description Schedule Purpose.of.Flight Air.Carrier
## 1 Part 91: General Aviation Personal
## 2 Part 91: General Aviation Personal
## 3 Part 91: General Aviation Personal
## 4 Part 91: General Aviation Personal
## 5 Part 91: General Aviation Positioning
## Total.Fatal.Injuries Total.Serious.Injuries Total.Minor.Injuries
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## 4 NA NA NA
## 5 NA NA NA
## Total.Uninjured Weather.Condition Broad.Phase.of.Flight Report.Status
## 1 1 VMC LANDING Preliminary
## 2 1 VMC Preliminary
## 3 1 VMC CRUISE Preliminary
## 4 NA Preliminary
## 5 2 VMC LANDING Preliminary
## Publication.Date
## 1 2018-01-08
## 2 2018-01-09
## 3 2018-01-04
## 4 2018-01-08
## 5 2018-01-08
The structure of the dataframe (shown below) shows that there is a total of 81,013 observations of 31 variables in this dataset. The variable types include factorial, numeric, and integer variables.
## 'data.frame': 81013 obs. of 31 variables:
## $ Event.Id : Factor w/ 79835 levels "20001204X00000",..: 79832 79831 79833 79828 79830 79829 79827 79824 79826 79823 ...
## $ Investigation.Type : Factor w/ 2 levels "Accident","Incident": 1 1 1 1 1 1 1 1 1 1 ...
## $ Accident.Number : Factor w/ 81013 levels "ANC00FA018","ANC00FA024",..: 81010 45832 16696 45833 16695 80965 81009 37058 37044 16665 ...
## $ Event.Date : Factor w/ 13005 levels "1948-10-24","1962-07-19",..: 13005 13004 13004 13003 13003 13002 13001 13000 13000 13000 ...
## $ Location : Factor w/ 25708 levels "",", Gabon",", Ireland",..: 20287 21673 17753 23609 14620 20476 17932 18859 23566 1488 ...
## $ Country : Factor w/ 178 levels "","Afghanistan",..: 169 169 169 169 169 169 169 164 169 169 ...
## $ Latitude : num 34.1 44.5 41.2 39.4 41.7 ...
## $ Longitude : num -117.2 -103.8 -84.7 -123 86.8 ...
## $ Airport.Code : Factor w/ 9788 levels "","-","---",".",..: 8485 8733 1 666 6879 8701 1 1 1 3498 ...
## $ Airport.Name : Factor w/ 23147 levels "","---","??vora Airport",..: 18217 1874 14356 1 13525 10323 14356 14356 14356 1320 ...
## $ Injury.Severity : Factor w/ 125 levels "Fatal(1)","Fatal(10)",..: 124 124 124 125 124 125 124 49 124 88 ...
## $ Aircraft.Damage : Factor w/ 4 levels "","Destroyed",..: 4 4 4 1 4 4 4 2 4 2 ...
## $ Aircraft.Category : Factor w/ 14 levels "","Airplane",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Registration.Number : Factor w/ 70233 levels "","00SLB","1227G",..: 11326 2645 32919 24570 35350 47841 45460 1 64577 13638 ...
## $ Make : Factor w/ 7728 levels "","107.5 Flying Corporation",..: 1303 5386 5405 103 1303 1303 1303 1303 1303 1303 ...
## $ Model : Factor w/ 11599 levels "","-269C","-737-222",..: 616 7912 11241 5090 1225 378 194 254 254 912 ...
## $ Amateur.Built : Factor w/ 3 levels "","No","Yes": 2 2 1 2 2 2 2 2 2 2 ...
## $ Number.of.Engines : int 1 1 1 1 NA 1 1 1 1 2 ...
## $ Engine.Type : Factor w/ 14 levels "","Electric",..: 8 1 8 1 1 1 8 8 8 8 ...
## $ FAR.Description : Factor w/ 19 levels "","Armed Forces",..: 15 15 15 15 15 15 15 4 15 15 ...
## $ Schedule : Factor w/ 4 levels "","NSCH","SCHD",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Purpose.of.Flight : Factor w/ 23 levels "","Aerial Application",..: 16 16 16 16 17 16 16 23 16 16 ...
## $ Air.Carrier : Factor w/ 2930 levels "","(DBA: [EMS])",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Total.Fatal.Injuries : int NA NA NA NA NA NA NA 2 NA 5 ...
## $ Total.Serious.Injuries: int NA NA NA NA NA NA NA NA NA NA ...
## $ Total.Minor.Injuries : int NA NA NA NA NA NA NA NA 1 NA ...
## $ Total.Uninjured : int 1 1 1 NA 2 NA 2 NA NA NA ...
## $ Weather.Condition : Factor w/ 4 levels "","IMC","UNK",..: 4 4 4 1 4 1 4 1 4 2 ...
## $ Broad.Phase.of.Flight : Factor w/ 13 levels "","APPROACH",..: 7 1 4 1 7 1 1 11 8 11 ...
## $ Report.Status : Factor w/ 4 levels "Factual","Foreign",..: 3 3 3 3 3 3 3 2 3 3 ...
## $ Publication.Date : Factor w/ 3709 levels "","1980-04-16",..: 3705 3706 3704 3705 3705 3703 3703 1 3705 3702 ...
Since this a United States database from the United States, it is highly likely that the majority of the reports accidents are from the US. Let’s check that calculating the percentage of reported accidents in this database from the US.
## [1] 0.9404417 1.0000000
As expected, the majority of accidents (over 94%) in this databased are from the US. Hence, we will be focusing on the accidents that took place in the US in this analysis.
As shown below, the new dataframe now features 76,188 observations.
## [1] 76188 31
We will further tailor our dataframe for this analysis by focusing on 7 variables of interest, out of the 31 available variables. Let’s focus on: (1) the date, (2) the total fatal injuries, (3) the engine type, (4) the weather condition, (5) the broad phase of flight, (6) Latitude, and (7) Longitude
Let’s check that our new dataframe ‘pf_study’ features all 5 variables.
## [1] "Event.Date" "Total.Fatal.Injuries" "Engine.Type"
## [4] "Weather.Condition" "Broad.Phase.of.Flight" "Latitude"
## [7] "Longitude"
## [1] 76188 7
As shown above, the dataframe we will be using for this study features 76,188 observations of 5 variables. Below is a summary of the dataframe.
Looking at the summary output we can get a quick overview of the distribution for each of our variables.
## Event.Date Total.Fatal.Injuries Engine.Type
## 1982-05-16: 25 Min. : 0.000 Reciprocating:64835
## 1984-06-30: 25 1st Qu.: 0.000 Turbo Shaft : 3127
## 1983-06-05: 24 Median : 0.000 Turbo Prop : 2974
## 1983-08-05: 24 Mean : 0.535 Turbo Fan : 1967
## 1984-08-25: 24 3rd Qu.: 1.000 Unknown : 1387
## 1986-05-17: 24 Max. :265.000 : 1227
## (Other) :76042 NA's :22821 (Other) : 671
## Weather.Condition Broad.Phase.of.Flight Latitude
## : 303 LANDING :19513 Min. :-61.88
## IMC: 5369 TAKEOFF :15370 1st Qu.: 33.69
## UNK: 597 CRUISE :10548 Median : 38.35
## VMC:69919 MANEUVERING: 9880 Mean : 38.96
## APPROACH : 7789 3rd Qu.: 42.48
## : 2736 Max. : 87.15
## (Other) :10352 NA's :50372
## Longitude
## Min. :-175.37
## 1st Qu.:-115.99
## Median : -95.66
## Mean : -98.19
## 3rd Qu.: -82.62
## Max. : 174.33
## NA's :50382
The variable related to the observation’s event date will need some further data wrangling to be able to extract meanigful information to draw conclusions from. For instance, it would be useful to break the observation’s event date by day, month, and year to investigate if there are any correlations between number of accidents/incidents and particular periods within a year.
Also, looking at the Total Fatalities, we can see that the 3rd Qu. is 1.000, the Max is 265. Clearly, this distribution will be heavily skewed. A better way of displaying this data is by grouping number of fatalities in buckets, as shown below. This will give us a better reprsentation of the distribution of fatalities across the observations in the dataset.
##
## [0,1] (1,2] (2,3] (3,4] (4,5] (5,10] (10,20]
## 46666 4136 1227 822 215 232 33
## (20,50] (50,100] (100,150] (150,200] (200,250] (250,300] <NA>
## 17 10 4 2 2 1 22821
In terms of the weather conditions, let’s replace blank entries in the weather conditions with “unknown”.
Similarly, let’s replace blank entries in the Phases of Flight with “unknown”.
In this section, we will create univariate plots for the 5 variables of interest.
Let’s have an initial look at the ‘Event.Date’ variable.
## Min. 1st Qu. Median Mean 3rd Qu.
## "1948-10-24" "1987-11-30" "1995-12-17" "1997-03-06" "2005-08-05"
## Max.
## "2017-12-29"
Now, let’s plots frequency histograms for the year, month, and weekday of accidents/incidents in the dataset.
It seems the majority of the observations in the dataframe are from after the early 1980s onwards. Let’s perform a quick check.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1948 1987 1995 1997 2005 2017
Let’s generate a plot from 1980 to 2017.
According to our plot, the number of accidents/incidents has substantially decreseased between 1982 and 2017 from approx. 3400 observations to approx. 1600 observations.
Next, let’s look at observations distribution by months of the year.
According to our plot, the highest frequency of accidents/incidents in a given year take place during northern hemisphere summer time (Jun-Jul-Aug).
And finally, let’s look at the observations distrubution by day of the week.
According to our plot, the highest frequency of accidents/incidents in a given week take place during the weekend (Sat-Sun).
Our next variable of interest relates to the Total Fatal Injuries for each observation in the dataset.This is quantified by the number of people injured for each recorded observation.
Let’s first have a look at the different levels in the factorial variable describing the severity of injury in our dataframe.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 0.535 1.000 265.000 22821
Let’s generate a plot.
We can see that the majority of recorded accidents/incidents have a low number of fatalities while some outliers are displaying large numbers of fatalities. A line plot makes that even clearer.
We’ve grouped the number of fatalities in the buckets shown in the plot below.
As previously stated, the charts above show that the bulk of accidents resulting in fatalities of less than 20.
Next, we look at the engine types recorded in our observations. Let’s have a look at the different engin types distribution in our dataframe.
## [1] "" "Electric" "Hybrid Rocket"
## [4] "None" "REC, ELEC" "REC, TJ, REC, TJ"
## [7] "REC, TJ, TJ" "Reciprocating" "TJ, REC, REC, TJ"
## [10] "Turbo Fan" "Turbo Jet" "Turbo Prop"
## [13] "Turbo Shaft" "Unknown"
## Electric Hybrid Rocket None
## 1227 4 1 13
## REC, ELEC REC, TJ, REC, TJ REC, TJ, TJ Reciprocating
## 1 1 2 64835
## TJ, REC, REC, TJ Turbo Fan Turbo Jet Turbo Prop
## 1 1967 648 2974
## Turbo Shaft Unknown
## 3127 1387
Let’s generate a plot. We’ve abbreviate engine type names to improve labelling of the x-axis.
A logarithmic scale would be suited also.
According to our charts, the bulk of engine types in the reported accidents/incidents are Reciprocal engine types.
Next, we look at the weather conditions recorded in our observations. Let’s have a look at the different weather conditions in our dataframe.
## VMC IMC UNK NA's
## 69919 5369 597 303
## VMC IMC UNK NA's
## 69919 5369 597 303
VMC stands for visual meteorological conditions and IMC stands for instrument meteorological conditions. VMC means that conditions are such that pilots have sufficient visibility to fly the aircraft maintaining visual separation from terrain and other aircraft.
IMC means weather conditions require pilots to fly primarily by reference to instruments.
Let’s generate a plot of the weather conditions distribution in our dataframe.
According to our plot, the bulk of accidents/incidents take place during VMC weather conditions.
Next, we look at the weather conditions recorded in our observations. Let’s have a look at the different weather conditions in our dataframe.
## Length Class Mode
## 76188 character character
According to our plot, the bulk of accidents/incidents take place during landing or take-off.
Next, we look at the latitudes recorded in our observations.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -61.88 33.69 38.35 38.96 42.48 87.15 50372
Next, we look at the latitudes recorded in our observations.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -175.37 -115.99 -95.66 -98.19 -82.62 174.33 50382
According to our plots, the bulk of accidents/incidents are taking place within a range of latitude of [20,50] and longitude of [-170,-70] degrees.
The initial dataset featured 81,013 observations of 31 variables related to aviation accidents recorded. From this dataframe, a dataframe for this study was extracted. It features 76,188 observations of 5 variables of interest to us: (1) Event Date, (2) Injury Severity, (3) Engine Type, (4) Weather Condition, and (5) Broad Phase of Flight when the accident or incident occured.
The main features in the data set are the number of fatalities and the other variables in the dataset. I’d like to determine which features are best for predicting the number of fatalities for a given observation of an accident or incident. I suspect that it will be some combination of variables that will allow us to build a predictive model for the amount of fatalities.
Yes, I had to create new variables to extract the year, month, and days of the observations. I also created a new variable to cluster fatalities numbers into buckets.
The dataset, even though provided by Kaggle, required a fair amount of data wrangling to make the best use of it. Nothing was unusual but I certainly appreciate the amount of time needed for data wrangling before being able to perfor an EDA on the dataset afer working on this dataset.
Let’s look at the relationship between pairs of variables that could show interesting relationship. First a recap of available variables in our dataframe.
## [1] "Event.Date" "Total.Fatal.Injuries"
## [3] "Engine.Type" "Weather.Condition"
## [5] "Broad.Phase.of.Flight" "Latitude"
## [7] "Longitude" "Event.Year"
## [9] "Event.Month" "Event.Weekday"
## [11] "Total.Fatal.Injuries.Brackets"
Let’s look at the relationship of Engine Types and Total Fatal Injuries.
We can see that given that the bulk of the distribution has a total fatal injuries under 20, let’s zoom in on that portion of the data.
According to the plot, the bulk of the data for fatalities under 20 is with the engine type Reciprocated. The first plot shows that the Tubo-Fan engine has more outliers with higher number of fatalities than other engins. This is likely due to the use of Turbo-Fan engines use on some large commercial aircraft.
Let’s look at the relationship of Weather Conditions and Total Fatal Injuries.
Weather conditions do not show a particularly strong relationship with total fatal injuries. The bulk of the distribution is associate with VMC weather conditions. However, that is likely to the fact that the vast majority of flights are flown in VMC conditions.
Let’s look at the relationship of Phase of Flight and Total Fatal Injuries
The plots show that Take-Off and Apprach are associated with outliers with high number of fatalities. These two phases of flight are often referred to as “critical phases of flight” for that particular reason. They’re high risk.
This boxplot further emphasises the presence of high outliers for take-off and approaches phases of flight. It also indicates that outliers are present for all broad phases of flight.
Let’s look at the relationship of Event Date and Total Fatal Injuries
The distribution is fairly distributed across all months of the year.
The box plot shows also outliers distributed fairly across all months of the year. However, norther winter months show higher fatalities for the upper bound of the interquartile range compared to the northern summer months.
The distribution looks to be fairly distributed across all days of the week.
Again, the box plot shows that the distribution is fairly distributed across all days of the week.
Both latitude and longitude plots are biased toward the United States coordinates due to our initial filtering of the data to focus on US observations.
The plot indicates that there is a higher frequency of recorded observations for IMC flying conditions for “cruise” and the “approach” compared to the rest of the phases of flight flown in IMC. In VMC conditions, the distribution looks fairly evenly distributed across all phases of flight.
The plot indicates that there is a higher frequency of recorded observations for nothern summer months during Landing and Take-off. Across all months, the heat map also shows that the Take-off and Landing register the highest number of observations.
These plots are essently maps of the US and clearly shows that the coastal states are more heavily impacted that the Mid-western states and most of Alaska. This can be explained by the volume of flight to/from destinations across the US.
## $estimate
## cor
## 0.001698444
## $estimate
## cor
## -0.02616494
some outliers particularly for the engine type TrbF, which is the Turbo-Fan engine type. Reciprocal, Turbo-Prop and Turbo-Shaft show the largest number of fatalities.
Regarding of the Weather Conditions and Total Fatal Injuries, the VMC and IMC conditions show the highest numbers of fatalities recorded. However, that may be insignificant as commercial aircraft can typically only fly if VMC or IMC conditions are present.
Regarding the relationship Phase of Flight and Total Fatal Injuries, there doesn’t seem to be any particular one phase of flight that jumps out. However, a set of flight phases show the highest numbers of fatalities, namely the approach and take-off.
Regarding the relationship of Event Date and Total Fatal Injuries, outliers with high number of fatalities are more present during northern summer. For the bulk of the distribution, no particular month shows a higher number of fatalities. Similarly for the Weekday, no particular Weekday shows a higher number of fatalities.
Regarding the relationship between Longitude/Latitude and Total Fatal Injuries, we can see that the incidents are clustered within certain ranges. This is the result of our selection to focus only on observations noting “United States” as Country. The plot of Latitude vs Longitude clearly shows the bulk of the observations being in the US. The correlation factors between longitude and latitude and total number of fatalities is a weak positive relationship.
accidents/incidents was interesting. It seems there is higher frequency of recorded observations for certain combinations of weather and what we could qualify higher risk phases of flight. For instance, IMC flying conditions while during “cruise” and the “approach”.
The stronger relationship I found was between engine types and total number of fatalities; and between weather condition and flight phase of recorded accidents/incidents in the dataframe.
First, let’s look at Latitude vs Longitude of observations by Month of the Year.
The distribution of month across latitude and longitude is fairly spread across the US, with a slightly higher prevalence of observations during the winter in southern states like Florida.
Let’s now look at latitude vs longitude add layer for the weather condition.
Weather condition VMC seems to be quite consistent except for patches of primarily IMC conditions for certain discrete areas.
Let’s now look at the relationship of Broad Phase of Flight vs Month by Weather Conditions.
When looking at the relationship between Broad Phase of Flight vs Month and Weather Condition, we can see that accident/incidents primarily take place during VMC weather condition. However, for certain months of the year such as December an January, IMC conditions are a non-negligible portion of the observations, particularly during Approach and Cruise phases of flight.
Next, let’s look at Total Fatal Injuries vs Broad Phases of Flight by Weather Condition.
In terms of the relationship between Total Fatal Injuries vs Broad Phases of Flight and Weather Condition, the scatter plot clearly indicates that the total number of fatalies shows an increased number with IMC as weather condition and during Cruise and Approach phases of flight. Furthermore, we can see that in general, low fatality count (less then 2) primarily took place in VMC weather conditions.
Finally, let’s look at the relationship between Total Fatal Injuries vs Engine Type by Year.
The relationship between Total Fatal Injuries vs Engine Type by Year indicates that certain engines have become prevaent in different time periods, and hence were more likely to be in use during accidents or incidents.
Let’s build a linear model using the variables in our “pf_study” dataframe.
##
## Calls:
## m1: lm(formula = I(Total.Fatal.Injuries) ~ I(Engine.Type), data = pf_study)
## m2: lm(formula = I(Total.Fatal.Injuries) ~ I(Engine.Type) + Weather.Condition,
## data = pf_study)
## m3: lm(formula = I(Total.Fatal.Injuries) ~ I(Engine.Type) + Weather.Condition +
## Broad.Phase.of.Flight, data = pf_study)
## m4: lm(formula = I(Total.Fatal.Injuries) ~ I(Engine.Type) + Weather.Condition +
## Broad.Phase.of.Flight + Latitude, data = pf_study)
## m5: lm(formula = I(Total.Fatal.Injuries) ~ I(Engine.Type) + Weather.Condition +
## Broad.Phase.of.Flight + Latitude + Event.Year, data = pf_study)
## m6: lm(formula = I(Total.Fatal.Injuries) ~ I(Engine.Type) + Weather.Condition +
## Broad.Phase.of.Flight + Latitude + Event.Month, data = pf_study)
## m7: lm(formula = I(Total.Fatal.Injuries) ~ I(Engine.Type) + Weather.Condition +
## Broad.Phase.of.Flight + Latitude + Event.Weekday, data = pf_study)
##
## =======================================================================================================================================================
## m1 m2 m3 m4 m5 m6 m7
## -------------------------------------------------------------------------------------------------------------------------------------------------------
## (Intercept) 1.873*** 1.232*** 1.254*** 1.442*** -4.185 1.405*** 1.347***
## (0.207) (0.209) (0.223) (0.161) (8.589) (0.177) (0.171)
## I(Engine.Type): Hybrid Rocket -0.873 -0.232 -0.455 0.377 0.368 0.431 0.407
## (2.788) (2.713) (2.724) (1.441) (1.441) (1.441) (1.439)
## I(Engine.Type): REC, TJ, REC, TJ 1.127 0.732
## (2.788) (2.713)
## I(Engine.Type): REC, TJ, TJ 0.127 0.250 0.113 0.477 0.481 0.532 0.418
## (1.977) (1.924) (1.924) (0.966) (0.966) (0.966) (0.965)
## I(Engine.Type): Reciprocating -1.407*** -0.861*** -0.708** 0.209 0.213 0.210 0.221
## (0.207) (0.209) (0.221) (0.121) (0.121) (0.121) (0.121)
## I(Engine.Type): TJ, REC, REC, TJ 1.127 1.768 1.908 1.637 1.643 1.681 1.727
## (2.788) (2.713) (2.712) (1.360) (1.360) (1.361) (1.359)
## I(Engine.Type): Turbo Fan 0.472* 0.781*** 1.018*** 0.666** 0.673*** 0.667** 0.710***
## (0.223) (0.224) (0.236) (0.203) (0.204) (0.203) (0.203)
## I(Engine.Type): Turbo Jet -1.196*** -0.740** -0.528* 0.260 0.268 0.247 0.286
## (0.243) (0.243) (0.254) (0.231) (0.231) (0.231) (0.231)
## I(Engine.Type): Turbo Prop -0.749*** -0.329 -0.186 0.793*** 0.796*** 0.786*** 0.826***
## (0.216) (0.218) (0.230) (0.142) (0.142) (0.142) (0.142)
## I(Engine.Type): Turbo Shaft -1.219*** -0.686** -0.658** 0.505*** 0.510*** 0.499*** 0.540***
## (0.216) (0.217) (0.229) (0.148) (0.148) (0.148) (0.148)
## I(Engine.Type): Unknown -1.711*** -1.092*** -0.907*** -0.490 -0.490 -0.431 -0.460
## (0.220) (0.221) (0.234) (0.801) (0.801) (0.803) (0.801)
## Weather.Condition: IMC/VMC 1.035*** 0.918*** 0.389*** 0.391*** 0.383*** 0.394***
## (0.041) (0.043) (0.055) (0.056) (0.056) (0.055)
## Weather.Condition: UNK/VMC 0.734*** 0.629*** -0.107 -0.128 -0.103 -0.087
## (0.115) (0.120) (0.410) (0.411) (0.410) (0.410)
## Broad.Phase.of.Flight: CLIMB/APPROACH 0.208** -0.038 -0.036 -0.032 -0.032
## (0.073) (0.111) (0.111) (0.111) (0.111)
## Broad.Phase.of.Flight: CRUISE/APPROACH -0.079 -0.014 -0.007 -0.019 -0.014
## (0.046) (0.073) (0.073) (0.073) (0.072)
## Broad.Phase.of.Flight: DESCENT/APPROACH -0.155* -0.187 -0.187 -0.194 -0.184
## (0.075) (0.108) (0.108) (0.108) (0.108)
## Broad.Phase.of.Flight: GO-AROUND/APPROACH -0.108 0.077 0.075 0.074 0.075
## (0.086) (0.129) (0.129) (0.129) (0.129)
## Broad.Phase.of.Flight: LANDING/APPROACH -0.563*** -0.683*** -0.681*** -0.688*** -0.682***
## (0.044) (0.114) (0.114) (0.114) (0.114)
## Broad.Phase.of.Flight: MANEUVERING/APPROACH 0.174*** -0.110 -0.109 -0.114 -0.111
## (0.047) (0.064) (0.064) (0.064) (0.064)
## Broad.Phase.of.Flight: OTHER/APPROACH 0.201 -0.815 -0.819 -0.813 -0.823
## (0.266) (0.482) (0.482) (0.482) (0.481)
## Broad.Phase.of.Flight: STANDING/APPROACH -0.542*** -0.661** -0.662** -0.678** -0.640**
## (0.107) (0.210) (0.210) (0.210) (0.210)
## Broad.Phase.of.Flight: TAKEOFF/APPROACH -0.162*** -0.074 -0.076 -0.077 -0.079
## (0.044) (0.069) (0.069) (0.069) (0.069)
## Broad.Phase.of.Flight: TAXI/APPROACH -0.693*** -0.689* -0.683* -0.746* -0.691*
## (0.078) (0.346) (0.346) (0.347) (0.346)
## Broad.Phase.of.Flight: UNKNOWN/APPROACH 0.606*** -0.100 -0.098 -0.105 -0.113
## (0.125) (0.158) (0.158) (0.158) (0.158)
## Latitude -0.000 -0.000 -0.000 0.000
## (0.003) (0.003) (0.003) (0.003)
## Event.Year 0.003
## (0.004)
## Event.Month: Feb/Jan 0.312**
## (0.115)
## Event.Month: Mar/Jan 0.067
## (0.113)
## Event.Month: Apr/Jan -0.085
## (0.109)
## Event.Month: May/Jan 0.016
## (0.107)
## Event.Month: Jun/Jan -0.009
## (0.103)
## Event.Month: Jul/Jan 0.095
## (0.101)
## Event.Month: Aug/Jan 0.090
## (0.101)
## Event.Month: Sep/Jan -0.009
## (0.105)
## Event.Month: Oct/Jan -0.022
## (0.106)
## Event.Month: Nov/Jan -0.014
## (0.108)
## Event.Month: Dec/Jan 0.047
## (0.107)
## Event.Weekday: Tue/Mon -0.098
## (0.084)
## Event.Weekday: Wed/Mon 0.010
## (0.081)
## Event.Weekday: Thu/Mon 0.093
## (0.080)
## Event.Weekday: Fri/Mon 0.065
## (0.078)
## Event.Weekday: Sat/Mon 0.107
## (0.075)
## Event.Weekday: Sun/Mon 0.226**
## (0.076)
## -------------------------------------------------------------------------------------------------------------------------------------------------------
## R-squared 0.012 0.023 0.033 0.045 0.045 0.049 0.049
## adj. R-squared 0.012 0.023 0.032 0.040 0.040 0.042 0.043
## sigma 2.781 2.705 2.703 1.354 1.355 1.353 1.352
## F 65.785 103.844 80.980 9.154 8.789 6.728 8.005
## p 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -130299.604 -128676.091 -127081.722 -7726.351 -7726.136 -7717.351 -7715.844
## Deviance 412595.700 389914.575 384537.871 8198.306 8197.518 8165.527 8160.051
## AIC 260623.208 257380.181 254211.444 15502.703 15504.271 15506.703 15493.689
## BIC 260729.828 257504.555 254424.365 15662.960 15670.938 15737.473 15692.407
## N 53367 53307 52666 4493 4493 4493 4493
## =======================================================================================================================================================
the US, although we do notice a higher prevalence of oberservations during the winter in southern states like Florida, and in summer in northern states like Alaska. This likely correspond to higher volumes of flights to/from these destinations in those periods of the year. In terms of latitude vs longitude by weather condition, The weather condition VMC seems to be quite consistent except for patches of primarily IMC conditions from certain discrete areas.
The relationship between Total Fatal Injuries vs Engine Type by Year was interesting as it also how certain engines have become prevaent in different time periods. For instance, Turbo-Jet and Turbo-Fan powered aircraft show a higher number of fatalities in later years whereas Reciprocated engies show a distribution of fatalities in earlier years.
A liner model was created but only showed a low R-squared value of 0.049. Hence, the variables in the linear model account for 4.9% of the variance in the total number of fatalities.
The relationship between Total Fatal Injuries vs Engine Type by Year was interesting as it also how certain engines have become prevaent in different time periods.
This is a great plot as the observations in our dataset clearly cover the entirety of the US space, hence outlining map of the US one oberservation at the time. A sad chart however as it shows that every state suffered an aviation tragedy between 1948 and 2017.
This is a great plot as it gives an indication of which phases of flight to be particularly worried about given a particular weather condition. For instance, if you know that the weather condition is VMC, watch out for the Take-off and Landing! ——
I was particularly suprised by two three: (1) the amount of time it took to read about and understand each variable to make the best use of the given data; (2) the amount of computational power it took to process a dataset of a little less then 80,000 observations - a small sample size in our day & age; (3) and finally how much data wrangling it took to prep the data despite the fact that this is considered “clean data” after Kaggle already did the tidying.
I mostly got around the challenges I found while doing the analysis by referring to my classe notes and reading StackOverflow question anwers. Also, using an R Markdown file also quick trials & erros which is very useful when exploring.
The info from this study could potentially inform pilots to pay particular attention for certain combination of weather conditions and phases of flight. As demonstrated by the liner model outcomes, it is quite difficult to predict number of fatalities but more variables could have potentially increase the R-squared factor of the predictive linear model shown in this study.
A great project nevertheless. I really enjoyed it.